PSCI 2270 - Week 3
Department of Political Science, Vanderbilt University
September 12, 2023
Learning about Population from Sample
Descriptive Statistics
Necessary Math
Types of Data Collection
We often cannot survey or measure outcome among the whole set of units we are interested in \(\Rightarrow\) Target population
We then have to resort to a subset of units that we can reasonably collect data for \(\Rightarrow\) Sample
We collect the sample from the available list that ideally includes the whole population \(\Rightarrow\) Sampling frame
Simple random sampling: Every unit has an equal selection probability
e.g. random digit dialing (RDD):
Non-probability sampling: e.g. Opt-in Internet panels
Literary Digest predicted elections using mail-in polls
Source of addresses: automobile registrations, phone books, etc.
In 1936, sent out 10 million ballots, over 2.3 million returned
| Pollster | FDR’s Vote Share |
|---|---|
| Literary Digest | 43% |
| George Gallup | 56% |
| Actual Outcome | 62% |
Ballots skewed toward the wealthy (with cars, phones) \(\Rightarrow\) selection bias
| Pollster | Truman | Dewey | Thurmond | Wallace |
|---|---|---|---|---|
| Crossley | 45% | 50% | 2% | 3% |
| Gallup | 44% | 50% | 2% | 4% |
| Roper | 38% | 53% | 5% | 4% |
| Actual Outcome | 50% | 45% | 3% | 2% |
Quota sampling:
Potential unobserved confounding \(\Rightarrow\) selection bias
Republicans easier to interview within quotas (phones, listed addresses, etc.)
Descriptive (summary) statistics are numerical summaries of those measurements
Two salient features of a variable that we want to know:
\[ \color{#98971a}{\bar{x}} = \color{#d65d0e}{\frac{1}{n}} \color{#458588}{\sum_{i = 1}^{n} x_{i}} \]
What’s all this notation?
Applied to the mean:
Median more robust to outliers:
Quantile (quartile, quintile, percentile, etc):
Interquartile range (IQR): a measure of variability
One definition of outliers: over 1.5 × IQR above the upper quartile or below lower quartile
\[ \text{sd} = \color{#cc241d}{\sqrt{\color{#b16286}{\frac{1}{n - 1}} \color{#98971a}{\sum_{i = 1}^{n}} \color{#458588}{(}\color{#d65d0e}{x_i - \bar{x}}\color{#458588}{)^2} }} \]
Steps:
Learning about Population from Sample
Descriptive Statistics
Probability:
Law of Large Numbers
Central Limit Theorem:
In real data, we will have a set of n measurements on a variable: \(X_1\) , \(X_2\), … , \(X_n\)
Empirical analyses: sums or means of these n measurements
Law of Large Numbers (LLN)
Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and finite variance \(\sigma^2\). Then, \(\bar{X}_{n}\) converges to \(\mu\) as \(n\) gets large.
The normal distribution is the classic “bell-shaped” curve.
Three key properties:
Central Limit Theorem (CLT)
Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2\). Then, \(\bar{X}_n\) will be approximately distributed \(N ( \mu, \sigma^2 / n )\) in large samples.
Approximation is better as \(n\) goes up \(\Rightarrow\) asymptotics
“Sample means tend to be normally distributed as samples get large.”
We usually only 1 sample, so we’ll only get 1 sample mean. So why do we care about LLN/CLT?
\[ SE = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]